Around the world, people are spending an increasing amount of time on their mobile devices for email, social networking, banking and a whole range of other activities. But typing on mobile devices can be a serious pain. SwiftKey, our corporate partner in this capstone, builds a smart keyboard that makes it easier for people to type on their mobile devices. One cornerstone of their smart keyboard is predictive text models. When someone types:
I went to the
the keyboard presents three options for what the next word might be. For example, the three words might be gym, store, restaurant. In this capstone you will work on understanding and building predictive text models like those used by SwiftKey.
The goal of this project is just to display that you’ve gotten used to working with the data and that you are on track to create your prediction algorithm. Please submit a report on R Pubs (http://rpubs.com/) that explains your exploratory analysis and your goals for the eventual app and algorithm. This document should be concise and explain only the major features of the data you have identified and briefly summarize your plans for creating the prediction algorithm and Shiny app in a way that would be understandable to a non-data scientist manager. You should make use of tables and plots to illustrate important summaries of the data set.
The motivation for this project is to:
Review criteria
library(stringr)
library(dplyr)
library(quanteda)
library(readtext)
library(R.utils)
library(ggplot2)
set.seed(3301)
Tasks to accomplish
Questions to consider
Dwonload data.
source("downloadData.R")
dataPath <- file.path("..", "data")
attach(downloadData(dataPath))
c(blogs, twitter, news, badwords)
## [1] "../data/final/en_US/en_US.blogs.txt"
## [2] "../data/final/en_US/en_US.twitter.txt"
## [3] "../data/final/en_US/en_US.news.txt"
## [4] "../data/bad-words.txt"
First, try to processing entire files using our scratch implementation:
tweets <- 0
wordsTwitter <- 0
sentencesTwitter <- 0
con <- file(twitter, "r")
while (FALSE && length(oneLine <- readLines(con, 1, warn = FALSE)) > 0) {
# Count tweet
tweets <- tweets + 1
# Show first 10 tweet
if(tweets <= 10) {
print(oneLine)
}
# Tokenize by regular expression
words <- str_split(oneLine, "\\s+")[[1]]
# To detect symbols like a ':)', initialise variable
symbols <- rep(FALSE, length = length(words))
# Each token:
for(i in 1:length(words)) {
# Extract token that has only symbol string
symbols[i] <- grepl("^[^a-zA-Z0-9]+$", words[i])
# numbers, aggregate in '[numbers]'
if(grepl("^[0-9]+$", words[i])) {
words[i] <- "[numbers]"
}
}
# Tokens
wordsPerLine <- length(simpleWords <- words[!symbols])
# Count tokens ending with punctuation as the number of sentences
for(i in 1:length(simpleWords)){
if(grepl("[.!?]$", simpleWords[i])) {
sentencesTwitter <- sentencesTwitter + 1
}
}
wordsTwitter <- wordsTwitter + wordsPerLine
remove(simpleWords, words)
}
close(con)
tweets
wordsTwitter
sentencesTwitter
## [1] "How are you? Btw thanks for the RT. You gonna be in DC anytime soon? Love to see you. Been way, way too long."
## [1] "When you meet someone special... you'll know. Your heart will beat more rapidly and you'll smile for no reason."
## [1] "they've decided its more fun if I don't."
## [1] "So Tired D; Played Lazer Tag & Ran A LOT D; Ughh Going To Sleep Like In 5 Minutes ;)"
## [1] "Words from a complete stranger! Made my birthday even better :)"
## [1] "First Cubs game ever! Wrigley field is gorgeous. This is perfect. Go Cubs Go!"
## [1] "i no! i get another day off from skool due to the wonderful snow (: and THIS wakes me up...damn thing"
## [1] "I'm coo... Jus at work hella tired r u ever in cali"
## [1] "The new sundrop commercial ...hehe love at first sight"
## [1] "we need to reconnect THIS WEEK"
## [1] 2360148
## [1] 29706404
## [1] 2818583
It takes a long time to calculate, so re-implement it using the package.
Loading files using the readtext package.
tweetFile <- readtext(twitter)
corpusTwitter <- corpus(tweetFile, cache = FALSE)
summary(corpusTwitter)
## Corpus consisting of 1 document:
##
## Text Types Tokens Sentences
## en_US.twitter.txt 566951 36719658 2588551
##
## Source: /Users/warhol/Documents/!work/Data-Science-Capstone/MilestoneReport/* on x86_64 by warhol
## Created: Mon Jul 30 20:36:18 2018
## Notes:
Tasks to accomplish
Tips, tricks, and hints
Loading the data in. This dataset is fairly large. We emphasize that you don’t necessarily need to load the entire dataset in to build your algorithms (see point 2 below). At least initially, you might want to use a smaller subset of the data. Reading in chunks or lines using R’s readLines or scan functions can be useful. You can also loop over each line of text by embedding readLines within a for/while loop, but this may be slower than reading in large chunks at a time. Reading pieces of the file at a time will require the use of a file connection in R. For example, the following code could be used to read the first few lines of the English Twitter dataset:con <- file(“en_US.twitter.txt”, “r”) readLines(con, 1) ## Read the first line of text readLines(con, 1) ## Read the next line of text readLines(con, 5) ## Read in the next 5 lines of text close(con) ## It’s important to close the connection when you are done See the ?connections help page for more information.
Sampling. To reiterate, to build models you don’t need to load in and use all of the data. Often relatively few randomly selected rows or chunks need to be included to get an accurate approximation to results that would be obtained using all the data. Remember your inference class and how a representative sample can be used to infer facts about a population. You might want to create a separate sub-sample dataset by reading in a random subset of the original data and writing it out to a separate file. That way, you can store the sample and not have to recreate it every time. You can use the rbinom function to “flip a biased coin” to determine whether you sample a line of text or not.
tweets <- as.numeric(countLines(twitter))
twitterSubSampling <- file.path(dataPath, "sub-sample.twitter.txt")
if(!file.exists(twitterSubSampling)) {
subSamplingRate <- .01
flipABiasedCoin <- rbinom(tweets, size = 1, prob = subSamplingRate)
conRead <- file(twitter, "r")
conWrite <- file(twitterSubSampling, "w")
len <- 0
while (length(oneLine <- readLines(conRead, 1, warn = FALSE)) > 0) {
len <- len + 1
if(flipABiasedCoin[len] == 1) {
writeLines(oneLine, conWrite)
}
}
close(conRead)
close(conWrite)
}
subTweets <- as.numeric(countLines(twitterSubSampling))
twitterTrain <- file.path(dataPath, "train.twitter.txt")
twitterTest <- file.path(dataPath, "test.twitter.txt")
if(!file.exists(twitterTrain) || !file.exists(twitterTest)) {
trainRate <- .7
flipABiasedCoin <- rbinom(subTweets, size = 1, prob = trainRate)
conRead <- file(twitterSubSampling, "r")
conWriteTrain <- file(twitterTrain, "w")
conWriteTest <- file(twitterTest, "w")
len <- 0
while (length(oneLine <- readLines(conRead, 1, warn = FALSE)) > 0) {
len <- len + 1
if(flipABiasedCoin[len] == 1) {
writeLines(oneLine, conWriteTrain)
} else {
writeLines(oneLine, conWriteTest)
}
}
close(conRead)
close(conWriteTrain)
close(conWriteTest)
}
trainTweets <- as.numeric(countLines(twitterTrain))
trainTweets
## [1] 16497
twitterCorpus <- corpus(readtext(twitterTrain), cache = FALSE)
summary(twitterCorpus)
## Corpus consisting of 1 document:
##
## Text Types Tokens Sentences
## train.twitter.txt 27590 257031 18142
##
## Source: /Users/warhol/Documents/!work/Data-Science-Capstone/MilestoneReport/* on x86_64 by warhol
## Created: Tue Jul 31 03:06:13 2018
## Notes:
profanity <- readLines(badwords)
Tasks to accomplish
Questions to consider
| Field | Unit | Sample sequence | 1-gram sequence | 2-gram sequence | 3-gram sequence |
|---|---|---|---|---|---|
| Computational linguistics | word | … to be or not to be … | …, to, be, or, not, to, be, … | …, to be, be or, or not, not to, to be, … | …, to be or, be or not, or not to, not to be, … |
twitterToken <- twitterCorpus %>%
# nomarize words
tokens(remove_punct = TRUE,
remove_numbers = TRUE) %>%
# removing profanity and other words
# tokens_remove(stopwords('english')) %>%
tokens_remove(profanity)
twitterDfm <- dfm(twitterToken)
topfeatures(dfm(twitterToken), 20)
## the to i a you and for in of is it my on that me
## 6411 5433 5024 4342 3888 3144 2738 2589 2585 2418 2133 2062 1917 1626 1462
## be at have with your
## 1278 1268 1231 1218 1190
dfm(twitterToken) %>%
textplot_wordcloud(min_count = 6,
random_order = FALSE,
rotation = .25,
color = RColorBrewer::brewer.pal(8, "Dark2"))
twitterTokenNoStopWord <- twitterCorpus %>%
# nomarize words
tokens(remove_punct = TRUE,
remove_numbers = TRUE) %>%
# removing profanity and other words
tokens_remove(stopwords('english')) %>%
tokens_remove(profanity)
topfeatures(dfm(twitterTokenNoStopWord), 20)
## just like get good love can day thanks rt now
## 1041 847 823 758 717 655 641 630 602 594
## one great new time u know today go lol see
## 572 546 524 524 522 519 515 473 450 447
dfm(twitterTokenNoStopWord) %>%
textplot_wordcloud(min_count = 6,
random_order = FALSE,
max_words = 100,
rotation = .25,
color = RColorBrewer::brewer.pal(8, "Dark2"))
featuresTwitter <- textstat_frequency(dfm(twitterTokenNoStopWord), n = 80)
# Sort by reverse frequency order
featuresTwitter$feature <- with(featuresTwitter, reorder(feature, -frequency))
ggplot(featuresTwitter, aes(x = feature, y = frequency)) +
geom_point() +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
twitterToken2Gram <- twitterToken %>%
tokens_ngrams(n = 2)
twitterDfm2Gram <- dfm(twitterToken2Gram)
topfeatures(twitterDfm2Gram, 20)
## in_the for_the of_the on_the to_be thanks_for
## 526 489 438 349 315 309
## to_the thank_you i_love at_the if_you i_have
## 296 262 248 241 238 226
## i_am have_a going_to for_a to_see will_be
## 225 219 215 211 196 196
## is_a a_great
## 190 189
twitterDfm2Gram %>%
textplot_wordcloud(min_count = 6,
random_order = FALSE,
max_words = 100,
rotation = .25,
color = RColorBrewer::brewer.pal(8, "Dark2"))
featuresTwitter2Gram <- textstat_frequency(twitterDfm2Gram, n = 80)
# Sort by reverse frequency order
featuresTwitter2Gram$feature <- with(featuresTwitter2Gram, reorder(feature, -frequency))
ggplot(featuresTwitter2Gram, aes(x = feature, y = frequency)) +
geom_point() +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
twitterToken3Gram <- twitterToken %>%
tokens_ngrams(n = 3)
twitterDfm3Gram <- dfm(twitterToken3Gram)
topfeatures(twitterDfm3Gram, 20)
## thanks_for_the thank_you_for i_love_you
## 144 62 59
## looking_forward_to i_want_to can't_wait_to
## 59 56 54
## going_to_be one_of_the have_a_great
## 49 49 46
## for_the_follow a_lot_of is_going_to
## 43 42 41
## to_see_you i_need_to i_have_to
## 41 41 34
## let_me_know a_great_day you_have_a
## 33 32 32
## i_have_a i'm_going_to
## 31 31
twitterDfm3Gram %>%
textplot_wordcloud(#min_count = 4,
random_order = FALSE,
max_words = 50,
rotation = .25,
color = RColorBrewer::brewer.pal(8, "Dark2"))
featuresTwitter3Gram <- textstat_frequency(twitterDfm3Gram, 60)
# Sort by reverse frequency order
featuresTwitter3Gram$feature <- with(featuresTwitter3Gram, reorder(feature, -frequency))
ggplot(featuresTwitter3Gram, aes(x = feature, y = frequency)) +
geom_point() +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
featuresTwitterFull <- textstat_frequency(dfm(twitterTokenNoStopWord))
summary(featuresTwitterFull)
## feature frequency rank docfreq
## Length:20798 Min. : 1.000 Min. : 1 Min. :1
## Class :character 1st Qu.: 1.000 1st Qu.: 5200 1st Qu.:1
## Mode :character Median : 1.000 Median :10400 Median :1
## Mean : 5.548 Mean :10400 Mean :1
## 3rd Qu.: 3.000 3rd Qu.:15599 3rd Qu.:1
## Max. :1041.000 Max. :20798 Max. :1
## group
## Length:20798
## Class :character
## Mode :character
##
##
##
quantile(featuresTwitterFull$frequency, c(0, .1, .5, .9, 1))
## 0% 10% 50% 90% 100%
## 1 1 1 8 1041
featuresTwitter2GramFull <- textstat_frequency(twitterDfm2Gram)
summary(featuresTwitter2GramFull)
## feature frequency rank docfreq
## Length:119887 Min. : 1.000 Min. : 1 Min. :1
## Class :character 1st Qu.: 1.000 1st Qu.: 29972 1st Qu.:1
## Mode :character Median : 1.000 Median : 59944 Median :1
## Mean : 1.703 Mean : 59944 Mean :1
## 3rd Qu.: 1.000 3rd Qu.: 89916 3rd Qu.:1
## Max. :526.000 Max. :119887 Max. :1
## group
## Length:119887
## Class :character
## Mode :character
##
##
##
quantile(featuresTwitter2GramFull$frequency, c(0, .1, .5, .9, 1))
## 0% 10% 50% 90% 100%
## 1 1 1 2 526
featuresTwitter3GramFull <- textstat_frequency(twitterDfm3Gram)
summary(featuresTwitter3GramFull)
## feature frequency rank docfreq
## Length:185306 Min. : 1.000 Min. : 1 Min. :1
## Class :character 1st Qu.: 1.000 1st Qu.: 46327 1st Qu.:1
## Mode :character Median : 1.000 Median : 92654 Median :1
## Mean : 1.102 Mean : 92654 Mean :1
## 3rd Qu.: 1.000 3rd Qu.:138980 3rd Qu.:1
## Max. :144.000 Max. :185306 Max. :1
## group
## Length:185306
## Class :character
## Mode :character
##
##
##
quantile(featuresTwitter3GramFull$frequency, c(0, .1, .5, .9, 1))
## 0% 10% 50% 90% 100%
## 1 1 1 1 144
Seems like Zipf’s law.
Tasks to accomplish
Questions to consider
simpleGoodTuring <- function(r, Nr, sd = 1.96) { # sd = 1.65
# number of words
N <- sum(r * Nr)
d <- diff(r)
## Turing estimate
# turing estimate index
ti <- which(d == 1)
# discount coefficients of Turing estimate
dct <- numeric(length(r))
dct[ti] <- (r[ti] + 1) / r[ti] * c(Nr[-1], 0)[ti] / Nr[ti]
## Linear Good-Turing estimate
Zr <- Nr / c(1, 0.5 * (d[-1] + d[-length(d)]), d[length(d)])
f <- lsfit(log(r), log(Zr))
coef <- f$coef
# corrected term frequency
rc <- r * (1 + 1 / r)^(1 + coef[2])
# discount coefficients of Linear Good-Turing estimate
dclgt <- rc / r
## make switch from Turing to LGT estimates
# standard deviation of term frequencies between 'r' and 'rc' (?)
rsd <- rep(1,length(r))
rsd[ti] <- (seq_len(length(r))[ti] + 1) / Nr[ti] * sqrt(Nr[ti + 1] * (1 + Nr[ti + 1] / Nr[ti]))
dc <- dct
for (i in 1:length(r)) {
if (abs(dct[i] - dclgt[i]) * r[i] / rsd[i] <= sd) {
dc[i:length(dc)] <- dclgt[i:length(dc)]
break
}
}
## renormalize the probabilities for observed objects
# summation of probabilities
sump <- sum(dc * r * Nr) / N
# renormalized discount coefficients
dcr <- (1 - Nr[1] / N) * dc / sump
# term frequency
tf <- c(Nr[1] / N, r * dcr)
p <- c(Nr[1] / N, r * dcr / N)
names(p) <- names(tf) <- c(0, r)
list(p = p, r = tf)
}
sgtFactory <- function() {
NrTbl1 <- textstat_frequency(twitterDfm) %>%
select(frequency) %>%
mutate(freqOfFrequency = 1) %>%
group_by(frequency) %>%
summarise_all(sum)
SGT1 <- simpleGoodTuring(NrTbl1$frequency, NrTbl1$freqOfFrequency)
NrTbl2 <- textstat_frequency(twitterDfm2Gram) %>%
select(frequency) %>%
mutate(freqOfFrequency = 1) %>%
group_by(frequency) %>%
summarise_all(sum)
SGT2 <- simpleGoodTuring(NrTbl2$frequency, NrTbl2$freqOfFrequency)
NrTbl3 <- textstat_frequency(twitterDfm3Gram) %>%
select(frequency) %>%
mutate(freqOfFrequency = 1) %>%
group_by(frequency) %>%
summarise_all(sum)
SGT3 <- simpleGoodTuring(NrTbl3$frequency, NrTbl3$freqOfFrequency)
c(
dUnigram = function(freq) {
SGT1$p[as.character(freq)]
},
dBigram = function(freq) {
SGT2$r[as.character(freq)] / freq
},
dTrigram = function(freq) {
SGT3$r[as.character(freq)] / freq
}
)
}
SGT <- sgtFactory()
nextWords <- function(input, outputs = 3, k = 0) {
# k is the least important of the parameters. It is usually chosen to be 0.
# However, empirical testing may find better values for k.
inputs <- str_split(input, "\\s+")[[1]]
inputsSize <- length(inputs)
if (inputsSize > 1) {
preTriGram <- paste(inputs[inputsSize - 1],
inputs[inputsSize],
sep = "_")
nextWordDfm <- dfm(tokens_select(twitterToken3Gram,
phrase(paste0(
preTriGram, "_*"
))))
} else {
if (inputs == "") { return() }
nextWordDfm <- NULL
}
preBiGram <- inputs[inputsSize]
# extract n-gram that starts with input
featuresNextWord <- NULL
if (length(nextWordDfm) > k) {
prevWordDfm <- dfm(tokens_select(twitterToken2Gram,
phrase(preTriGram)))
prevWordFreq <- textstat_frequency(prevWordDfm)$frequency
# data frame
featuresNextWord <-
textstat_frequency(nextWordDfm) %>%
mutate(p_bo = SGT$dTrigram(frequency) * frequency / prevWordFreq)
# human readable outputs
featuresNextWord$feature <-
sapply(as.vector(featuresNextWord$feature),
function(x) {
str_split(x, "_")[[1]][3]
})
# Sort by reverse frequency order
featuresNextWord$feature <-
with(featuresNextWord,
reorder(feature,-p_bo))
} else {
nextWordDfm <- dfm(tokens_select(twitterToken2Gram,
phrase(paste0(
preBiGram, "_*"
))))
if (length(nextWordDfm) > k) {
prevWordDfm <- dfm(tokens_select(twitterToken,
phrase(preBiGram)))
prevWordFreq <- textstat_frequency(prevWordDfm)$frequency
# data frame
featuresNextWord <-
textstat_frequency(nextWordDfm) %>%
mutate(p_bo = SGT$dBigram(frequency) * frequency / prevWordFreq)
# human readable outputs
featuresNextWord$feature <-
sapply(as.vector(featuresNextWord$feature),
function(x) {
str_split(x, "_")[[1]][2]
})
# Sort by reverse frequency order
featuresNextWord$feature <-
with(featuresNextWord,
reorder(feature,-p_bo))
} else {
nextWordDfm <- twitterDfm
featuresNextWord <-
textstat_frequency(nextWordDfm) %>%
mutate(p_bo = SGT$dUnigram(frequency))
# Sort by reverse frequency order
featuresNextWord$feature <-
with(featuresNextWord,
reorder(feature,-p_bo))
}
}
featuresNextWord %>% slice(1:outputs)
}
I went to be
ggplot(nextWords("I went to be"), aes(x = feature, y = p_bo)) +
geom_bar(stat = "identity") +
xlab("Next word") + ylab("P_bo")
ggplot(nextWords("aaaaa bbbbb zzzzz"), aes(x = feature, y = p_bo)) +
geom_bar(stat = "identity") +
xlab("Next word") + ylab("P_bo")